Goto

Collaborating Authors

 best performance


Appendix A Theory

Neural Information Processing Systems

In this section, we show the proofs of the results in the main body. Eq. (1) satisfies the triangle inequality, i.e., for any scoring functions For the second inequality, we prove it similarly. Before we present the proof of the theorem, we first provide some lemmas. By applying Lemma A.2, the following holds with probability at least 1 α: null R F). Thus we have: null R A.1, we can get that the margin loss satisfies the triangle inequality. By Lemma A.4, we have R By Theorem 4.4, the following holds for any Based on Theorem A.6, the following standard error bound for gradual AST can be derived similarly to Corollary 4.6.


Supplementary Material

Neural Information Processing Systems

The color has been normalized to be between 0 and 1 which does not affect the clustering or visualization. We can see that output representation from later layers yield more patterned kernel matrices with more erratic clustering.


Recommending Composite Items Using Multi-Level Preference Information: A Joint Interaction Modeling Approach

Bi, Xuan, Wang, Yaqiong, Adomavicius, Gediminas, Curley, Shawn

arXiv.org Machine Learning

Recommender systems have become ubiquitous across a wide range of fields, such as ecommerce, media consumption (including movies, books, music, news, etc.), social networks, finance, and many others, due to their effectiveness in identifying relevant items or content among numerous choices [1, 2]. Traditionally, recommender systems, largely based on collaborative filtering techniques, have focused on recommending individual (or "atomic") items, such as movies or books, by understanding users' preferences for these individual items. However, in certain application domains, recommending "composite" items (i.e., combinations of atomic items) represents a very important capability. For illustration, consider a clothing/fashion recommender system, where we want to recommend "outfits" - combinations of tops (t-shirts, shirts, sweaters) and bottoms (pants, skirts, shorts) - to users. In such a case, multiple fashion items in a recommended outfit ideally have to match both functionally and stylistically, which may require domain expertise (e.g., on things like style compatibility) beyond individual preferences. Another key challenge for such recommender systems is that a given user's personal preference for a composite item may not directly translate to the user's personal preferences for the underlying atomic items and vice versa.


MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making

Neural Information Processing Systems

Foundation models are becoming valuable tools in medicine. Yet despite their promise, the best way to leverage Large Language Models (LLMs) in complex medical tasks remains an open question. We introduce a novel multi-agent framework, named **M**edical **D**ecision-making **Agents** (**MDAgents**) that helps to address this gap by automatically assigning a collaboration structure to a team of LLMs. The assigned solo or group collaboration structure is tailored to the medical task at hand, a simple emulation inspired by the way real-world medical decision-making processes are adapted to tasks of different complexities. We evaluate our framework and baseline methods using state-of-the-art LLMs across a suite of real-world medical knowledge and clinical diagnosis benchmarks, including a comparison ofLLMs' medical complexity classification against human physicians. MDAgents achieved the **best performance in seven out of ten** benchmarks on tasks requiring an understanding of medical knowledge and multi-modal reasoning, showing a significant **improvement of up to 4.2\%** ($p$ < 0.05) compared to previous methods' best performances. Ablation studies reveal that MDAgents effectively determines medical complexity to optimize for efficiency and accuracy across diverse medical tasks. Notably, the combination of moderator review and external medical knowledge in group collaboration resulted in an average accuracy **improvement of 11.8\%**.


Clair Obscur sweeps The Game Awards with nine wins

BBC News

Clair Obscur: Expedition 33 has been named game of the year in a record-breaking haul at this year's Game Awards. The French-developed role-playing game (RPG) cleaned up in nine of the 10 categories it was up for, with further wins in best narrative, best music and best performance. It fended off competition from Death Stranding 2, Nintendo platformer Donkey Kong Bananza, indie games Hollow Knight: Silksong and Hades 2, and medieval adventure Kingdom Come: Deliverance 2 to claim the top prize. During the ceremony in Los Angeles, players also got their first glimpses of two new Tomb Raider games, sequel Control Resonant and a new Star Wars role-playing game. Clair Obscur is set in a world where a supernatural being known as The Paintress prevents the population from growing past a certain age.


MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

Mi, Yapeng, Li, Hengli, Zhao, Yanpeng, Li, Chenxi, Wu, Huimin, Ma, Xiaojian, Zhu, Song-Chun, Wu, Ying Nian, Li, Qing

arXiv.org Artificial Intelligence

Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR's non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.


$\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion

Zhan, Zhihao, Zhou, Jiaying, Zhang, Likui, Lv, Qinhan, Liu, Hao, Zhang, Jusheng, Li, Weizheng, Chen, Ziliang, Chen, Tianshui, Wang, Keze, Lin, Liang, Wang, Guangrun

arXiv.org Artificial Intelligence

Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. Yet existing VLA models still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. Compared with continuous diffusion policies, E0 offers two key advantages: (1) discrete action tokens align naturally with the symbolic structure of pretrained VLM/VLA backbones, enabling stronger semantic conditioning; and 2. discrete diffusion matches the true quantized nature of real-world robot control-whose hardware constraints (e.g., encoder resolution, control frequency, actuation latency) inherently discretize continuous signals-and therefore benefits from a Bayes-optimal denoiser that models the correct discrete action distribution, leading to stronger generalization. Compared with discrete autoregressive and mask-based discrete diffusion models, E0 supports a significantly larger and finer-grained action vocabulary and avoids the distributional mismatch introduced by masking-based corruptions-yielding more accurate fine-grained action control. We further introduce a spherical viewpoint perturbation augmentation method to improve robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, and ManiSkill show that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average. Real-world evaluation on a Franka arm confirms that E0 delivers precise, robust, and transferable manipulation, establishing discrete diffusion as a promising direction for generalizable VLA policy learning.


PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

Guan, Shuhao, Lin, Moule, Xu, Cheng, Liu, Xinyi, Zhao, Jinman, Fan, Jiexin, Xu, Qi, Greene, Derek

arXiv.org Artificial Intelligence

This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents. First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.


Comparison with the vanilla SGD baseline

Neural Information Processing Systems

We thank the reviewers for their comments. We will carefully modify the paper according to the suggestions.Figure 1: Comparison of different learning schemes on RotMNIST classification and IWSL T translation tasks. For the NMT tasks, we used the same parameter settings from previous papers, as described in section 5.2. Assistant model shows similar performance over different batch sizes. However, we will provide results on raw ImageNet dataset and large Transformer model in the revised version.


We sincerely thank all three reviewers for their valuable comments, with the following being our responses

Neural Information Processing Systems

We sincerely thank all three reviewers for their valuable comments, with the following being our responses. As such, although the weights in Eq. (2) are learned to be'static', the enriched The sample sizes for all the models are fixed as 4096. R1) Regarding the advance of the proposed model. We will include such experiments in our revised paper. R1) Regarding missing related work.